Fast Multi-Match Lempel-Ziv
نویسندگان
چکیده
One of the most popular encoder in the literature is the LZ78, which was proposed by Ziv and Lempel in 1978. After the original paper was published, many variants of the LZ78 were proposed to improve its performance. Simulation results have shown that the well known LZW version has an improvement around 10%. Given a sequence u ∈ A, where A is the source alphabet, all versions of LempelZiv parse the sequence in blocks bi, such that u = b k 1, and then encode it using the previous blocks bi−1 1 . We can summarize each step of Lempel-Ziv encoders as a variable-length to fixed-length code, which maps a variable-length block (belongs to a dictionary Di−1, produced by bi−1 1 ) to a fixed-length codeword (dlog|Di−1|e). Therefore, for each step, the best strategy (to maximize the compression rate) is to find the longest word (match) bi ∈ Di−1. However, this might be no longer true if one is allowed to look ahead, that is, if the encoding is regarded, for instance, as a two-steps-at-a-time procedure. In other words, the longest double-match can be longer than the longest match followed by another longest match. In [1], Finamore et. al. proposed a new parsing rule of the Lempel-Ziv algorithm, bringing up new variations of the algorithm (generally designated by Multi-match Lempel-Ziv mmLZ). Instead of search the longest match, the mmLZ seek to find the longest m-tuple match, in each step. Given the dictionary Di−1, the mmLZ find the words bi, . . . , bi+m−1, which maximize the sum |bi| + . . . + |bi+m−1|. However, the complexity of this algorithm seems prohibitive even for small values of m. In this work we establish a recursive way to find the longest m-tuple match. We prove the following theorem that shows how to obtain a longest (m + 1)-tuple match from the longest m-tuple match. It shows that a (m + 1)-tuple match is the concatenation of the first (m − 1) words of the m-tuple match with the next longest double match. Therefore, the longest (m + 1)-tuple match can be found using the m-tuple match and a procedure to compute the longest double match. Theorem Let A be a source alphabet, let A∗ be the set of all finite strings of A, and D ⊂ A∗, such that if x ∈ D then all prefixes of x belong to D. Let u denote a one-sided infinite sequence. If b1 is the longest m-tuple match in u, w.r.t. D, then there is a longest (m + 1)-tuple match b̂ 1 , such that b̂i = bi, ∀i ∈ {1, . . . m − 1}. We implemented the fast mmLZ and the results have shown a improvement in compression around 5% over the LZW, in the Canterbury Corpus [2] with little extra computational cost. [1] W. A. Finamore, M. S. Pinho, & M. Craizer, “A multi-string match algorithm for lossless data compression,” Proc. of 7 International Colloquium on Numerical Analysis and Comp. Sci. with App., p. 39, Plovdiv, Bulgaria, Aug. 1998. [2] R. Arnold, & T. Bell, “A corpus for the evaluation of lossless compression algorithms,” Proc. of IEEE Data Comp. Conf., pp. 201-210, UT, March, 1997.
منابع مشابه
A New Approach to Detect Congestive Heart Failure Using Symbolic Dynamics Analysis of Electrocardiogram Signal
The aim of this study is to show that the measures derived from Electrocardiogram (ECG) signals many a time perform better than the same measures obtained from heart rate (HR) signals. A comparison was made to investigate how far the nonlinear symbolic dynamics approach helps to characterize the nonlinear properties of ECG signals and HR signals, and thereby discriminate between normal and cong...
متن کاملFast Relative Lempel-Ziv Self-index for Similar Sequences
Recent advances in biotechnology and web technology are generating huge collections of similar strings. People now face the problem of storing them compactly while supporting fast pattern searching. One compression scheme called relative Lempel-Ziv compression uses textual substitutions from a reference text as follows: Given a (large) set S of strings, represent each string in S as a concatena...
متن کاملA New Approach to Detect Congestive Heart Failure Using Symbolic Dynamics Analysis of Electrocardiogram Signal
The aim of this study is to show that the measures derived from Electrocardiogram (ECG) signals many a time perform better than the same measures obtained from heart rate (HR) signals. A comparison was made to investigate how far the nonlinear symbolic dynamics approach helps to characterize the nonlinear properties of ECG signals and HR signals, and thereby discriminate between normal and cong...
متن کاملLempel-Ziv Jaccard Distance, an Effective Alternative to Ssdeep and Sdhash
Recent work has proposed the Lempel-Ziv Jaccard Distance (LZJD) as a method to measure the similarity between binary byte sequences for malware classification. We propose and test LZJD’s effectiveness as a similarity digest hash for digital forensics. To do so we develop a high performance Java implementation with the same command-line arguments as sdhash, making it easy to integrate into exist...
متن کاملLempel-Ziv factorization: Simple, fast, practical
For decades the Lempel-Ziv (LZ77) factorization has been a cornerstone of data compression and string processing algorithms, and uses for it are still being uncovered. For example, LZ77 is central to several recent text indexing data structures designed to search highly repetitive collections. However, in many applications computation of the factorization remains a bottleneck in practice. In th...
متن کامل